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1. Introduction 

The aim of these Lectures is to provide an overview of statistical tools, which are 
currently used for the study of the large-scale distribution of cosmic structures and 
which go beyond the simple (although useful!) two-point correlation function. The 
reason why we need such "higher-order" informations lies essentially in the fact that 
two-point correlations exhaust the statistical content of a system only in the case it 
has Gaussian nature. On the other hand, even allowing for random-phase initial 
conditions, there are good reasons to expect that the present-day distribution of 
galaxies and galaxy clusters have non-Gaussian features, which are rooted in the 
dynamical history of their formation and evolution. 

It is clear that a statistical description of the large-scale texture of the Universe, 
which is as complete as possible, is not only required to provide a cosmographical 
description, i.e. to merely understand whether galaxies are preferentially located in 
clusters or in filaments, or by how much they leave devoid the underdense parts of 
their distribution. Instead, this information is a necessary ingredient to solve the 
dynamical problem of cosmic structure formation: once an assumption is made for 
the underlying dynamics governing the evolution (e.g., gravitational instability), and 
having (almost) fixed the amplitude of large-scale fluctuations thanks to COBE, one's 
hope is that the statistical knowledge of the galaxy distribution at the present time 
should be univocally related to the nature of primordial fluctuations and, hopefully, 
to the dark matter content of the Universe. 

Having this in mind, it is clear that the characteristics we would desire for 
statistical descriptors are the following (see also ref.[64]): 

(a) to be robust, that is to be able to provide statistically significant results even 
when dealing, as usual, with rather limited data sets; 

(b) to be discriminative, so as to pick up significant differences when applied to 
different dark-matter models; 

(c) to be interpretable, so that the statistical information it provides can be easily 
connected to dynamical and physical quantities; 

(d) to be assumption— free, so that the results it provides are not sensitive to the 
way of identifying galaxies and galaxy clusters in collisionless simulations. 
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After the first attempts to statistically describe the galaxy clustering, which date 
back to about 25 years ago, many such methods have been proposed and applied. 
Providing a description for all of them would go by far beyond the scope of these 
Lectures. Therefore, in the following I will mainly deal with those of such methods 
which have been more commonly used. Furthermore, instead of providing technical 
details about their implementation in practical applications, I will discuss which kind 
of information they provide and what we have learned up to now from them. For 
readers which were interested in specific aspect, as well as on other statistical methods, 
I refer to refs.[8, 69] for recent comprehensive reviews, as well as to the relevant 
literature quoted therein. 

In more detail, I will concentrate on correlation analysis methods and related 
issues, like count-in-cell statistics and probability density function. This choice is 
not just motivated by historical reasons (correlation functions have been the first 
quantities which have been measured in extended galaxy samples), but mainly by the 
fact that this type of analysis is still today the most commonly applied to analyze 
both real data sets and numerical simulations. In this context, Section 2 introduces 
the statistical formalism, while Section 3 is devoted to a brief description of the 
main sources of uncertainties in statistical analyses. Section 4 describes the results 
of correlation analysis of observational samples and their interpretation. Section 5 
deals with geometrical descriptions of the large-scale clustering, like those provided 
by the void probability function and topological characteristics. Final comments are 
deserved to Section 5. 

2. The correlation statistics 

The classical correlation analysis of the galaxy distribution was based on the 
determination of the 2-point correlation function, £(r). Its definition is related to 
the joint probability 

6™P = n 2 6V 1 6V 2 [1 + £(r 12 )] (1) 

of finding an object in the volume element 6V\ and another one in 8V2, at separation 
T\2- In eq.(l) the factorization of the n 2 term (n being the galaxy mean number 
density) makes £(r) a dimensionless quantity and the total probability turns out to be 
normalized to the square of the total number of object in the distribution. According 
to its definition, the value of the correlation function is a measure of the non-random 
behaviour of the distribution and, for an isotropic clustering, depends only on the 
modulus of the separation vector v\2. In particular, object positions are said to 
be correlated if £(r) > and anticorrelated if — 1 < £(r) < 0, while a Poissonian 
distribution is characterized by £(r) = at any separation. 

The concept of correlation functions can be extended to higher orders, by 
considering the joint probabilities between more than two points. In the following 
I will introduce the concept of correlations of generic order for a given density field. 

2.1. Correlation functions 

Let us consider a generic density field, p(x), and the relative fluctuations, 6(x) = 
(p(x) — p)/p around the average density p. By definition, it is (6(x)) = 0, while the 
requirement of a positively defined p(x) leads to 5(x) > — 1. In the following, 5(x) is 
assumed to be described by a random function, so that the Universe can be considered 
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as a particular realization taken from an ensemble (functional space) T containing all 
the 6(x) fields satisfying the above two requirements. 

In order to describe the statistics of the 6(x) field, let 'P^x)] be the probability 
that the density fluctuations are described by a given 5(x) £ T. With the 
assumption of statistical homogeneity, the probability functional 'P^x)] turns out 
to be independent of the position x, while, due to the requirement of isotropic 
clustering, the joint distribution of 5(xi) and 8(^2) depends only on the the separation 
r\2 = |xi — X2I. By definition, the probability distribution in the functional space 
must be normalized so that the total probability is unity: Jj-T>[8(x)]V[8(x)] = 1. 
Here X>[6(x)] represents a suitable measure introduced in T in order to define the 
functional integral. 

Let us consider the 'partition functional 

M[r(x)] = J £>[«5(x)]7>[«5(x)]e/ dx * (x)r(x) = ( e / fl,(I[)r(I) ), (2) 

where t(x) is a generic function, that plays the role of an external source perturbing 
the underlying statistics. A complete characterization of the statistics of the density 
distribution can be given in terms of the ra-point correlation functions 

8 n M[r] 



/x„(xi, . . .,x„) = (<5(xi) . . .<5(x„)) 



(3) 



<5t(xi) . . .<5r(x„) r=0 ' 

with /xi(x) = (5(x)) = 0. The notation ( • ) indicates the average over the T 
space while 8/8t(x) stands for the functional derivative with respect to t(x) £ T 
(see Appendix for the meaning of the functional derivative). Eq.(3) represents 
the statistical-mechanical equivalent of the path-integral definition of the Green's 
functions in quantum field theory (see, e.g., ref.[62]). Under the assumption of 
ergodicity of our system, the averages taken over the (physical) configuration space 
are completely equivalent to the expectations taken over an ensemble of universes, i.e. 
over the functional space T . From now on I will indifferently use the symbol ( • ) to 
indicate both kinds of average. 

A further characterization of 5(x) is also given in terms of the connected or 
irreducible correlation functions, /c„(xi, . . . , x„). Such quantities are introduced 
through their generating functional 

K[t(x)] = lnM[r(x)], (4) 



so that 



_ 6"K[t] 

1 ) • • • ) ^-n J 



(5) 

<5t(xi)...<5t(x„) r=0 ' 

Therefore, a unique characterization of the statistics, i.e. the knowledge of the 
partition functions, requires that correlation functions of any order are known. 

For n = 2, it is easy to show that the definition (3) of correlation function is 
completely equivalent to that provided by eq.(l). In fact, the 2-point joint probability 
of having the density values p(xi) in the position xi and p(x2) in X2 is (p(xi) p(x2) ) = 
p 2 [l + ^2,12] , which coincides with eq.(l), once we take £(ri2) = /^(''i^)- 

In order to study the structure of the 3-point correlation function, let us suppose 
that the point X3 is sufficiently far away from xi and X2, so that the event probability 
in X3 does not depend on that in the other two points. If this is the case, the 3-point 
joint probability is 

(P1P2P3) = (PlP2> x p, (6) 



4 



where the meaning of the indices is obvious. Hence, requiring symmetry for the 
exchange of X3 with xi and with X2, the 3-point probability can be cast in the form 

(PlP2P3> = P 3 [1 + 62 + 6s + 6s + Cl23] • (7) 

Here, £ = K3 is the term that correlates the three points all together and must vanish 
when one of these points is removed: 

C(x,-,Xj, x ( — > 00) = izfzjzfzl ; i,j,l= 1,2,3. (8) 

On the basis of similar considerations, the 4-point joint probability is written as 

(P1P2P3P4) = P 4 {l + [62 + -"6 terms] + [£i 23 + -A terms] +/x 4 } -(9) 

Here the 4-point correlation function 

^4,1234 = ^12^34 + ^23^14 + ^13^24 + ?7l234 (10) 

represents the term connecting the four points and gives a vanishing contribution 
when at least one point is moved to infinite separation from the others. The /X4 
term contains three terms connecting two pairs separately, while 77 = K4 is the usual 
notation to indicate the connected 4-point function, which accounts for the amount 
of correlation due to the simultaneous presence of the four points. 

In general, correlations of order n are related to the ra-point joint probability, 

(p(xi) . . .p(x„)) = p n [l+ (terms of order < n) +/x„(xi, ...,x„)] , (11) 

in such a way that they give null contribution when any subset of {xi,...,x„} is 
removed to infinity. In turn, an important theorem of combinatorial analysis shows 
that, removing from the /x„ function all the disconnected contributions, the remaining 
connected part is just the K n function defined by eq.(5). The general proof of this 
theorem is rather tricky and will not be reported here. It is however not difficult to 
see that, expressing the derivatives of the /C[r] partition function in terms of that of 
A4[r], we get at the first correlation orders 

M2 = k 2 , fj, 3 = K 3 , fj, 4 = 3«2 + k 4 , fj, 5 = IO/C2K3 + k 5 , 

He = 15«2 + 10/c| + I5/C2K4 + «6 • (12) 

From eq.(ll), it follows that the ra-point functions measure by how much the 
distribution differs from a completely random (Poissonian) process. In fact, for a 
Poissonian distribution the probability of some events in any subset of {xi,...,x„} 
does not affect the probability in the other points. Accordingly, (p(xi) ...p(x„)) = p n 
and correlations of any order vanish. 

2.2. Correlations of a Gaussian field 

A particularly interesting and simple case is that in which the density fluctuations 
are approximated by a random Gaussian process. The important role of Gaussian 
perturbations in cosmological context lies in the fact that, according to the classical 
inflationary scenario, they are expected to be originated from quantum fluctuations 
of a scalar field at the outset of the inflationary expansion (see, e.g., ref.[57] and 
references therein). Even without resorting to inflation, the Central Limit Theorem 



5 



guarantees that the Gaussian statistics is the consequence of a large variety of random 
processes, which makes it a sort of natural choice. 

The Gaussian probability distribution in the functional space T takes the form 

V[6(x)] = (detC)- 1/2 exp|-iy dx J dx' 5(x)C'- 1 (x, x')«5(x')j .(13) 

Here C(x, x') is called the correlation operator, which must be invertible and 
symmetric with respect to the variables x, x'. From eq.(13), it follows that 
this operator determines the variance of the distribution and, more generally, the 
correlation properties of the fluctuation field. The above expression of the probability 
distribution is such as to satisfy the normalization requirement. The corresponding 
partition functional A4[r] is 

M[t] = (detC)- 1/2 J V[6(x)] exp|-^y dx J dx' 5(x)C'- 1 5(x') + 
+ J dx6(x)r(x)] 

(14) 



exp ^ J dx J dx' r(x)Cr(x') 



According to the definition (4) of JC[t], the generator of the connected correlation 
functions reads 

K[t] = ^ J dx J dx'r(x)C(x,x')r(x'), (15) 

so that the corresponding connected correlation functions are 
k 2 (xi,x 2 ) = C(xi,x 2 ) 

k„(x 1 ,...,x„) = if n>2. (16) 

Therefore, the fundamental property of a Gaussian density field is that its statistics 
is completely determined by 2-point correlations. 

Although Gaussian density fluctuations are the natural outcome of simple 
inflationary schemes, nevertheless the observed distribution of cosmic structures 
displays a clear non-Gaussian behaviour, as the detection of non-vanishing higher- 
order correlations shows (see below). However, even starting with an initial Gaussian 
density field, there are at least two valid motivations to understand the development of 
subsequent non-Gaussian statistics for the galaxy distribution. Firstly, note that the 
Gaussian statistics assign a non vanishing probability even to the unphysical values 
5(x) < — 1. However, as long as the variance of 8 is much less than unity, the Gaussian 
distribution is a good approximation, since a negligible probability is assigned to 
8 < — 1. On the other hand, as soon as 8 grows by gravitational instability, it is allowed 
to keep arbitrarily large and positive values while the 8 < — 1 region remains forbidden, 
thus forcing V[8] to become more and more skewed. Secondly, non-Gaussian statistics 
is also expected in the framework of "biassed" models of galaxy formation [39, 3], in 
which the observed cosmic structures are identified with those peaks of the underlying 
Gaussian matter field that exceed a critical density value. In this case, analytical 
argument [63, 36] shows that the non-Gaussian behaviour arises as a threshold effect 
superimposed on a Gaussian background. 
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2.3. The count-in-cell statistics 

In order to pass from the study of a continuous density field to that of a single variable, 
let us suppose to sample the density field p(x) with volume elements of size R, whose 
shape is described by the window function VFr(x). The resulting observable quantity 
is the variable 

PR = J d 3 xp{x)W R {x). (17) 

The function VFr(x) is normalized so that J d 3 xW(x) = 1 and p R is the local average 
of the density within the sampling volume. Commonly adopted choices for W R are 
the top-hat window, 

and the Gaussian window 

W fl (x) = (2^R 2 )- 3,2 e- W2,2R2 . (19) 

In analogy with the correlation function description of the continuous field p(x), the 
statistics of p R is completely specified by its probability density function (PDF) p(p R ). 
The corresponding moment of order n is 

m n {R) = J dp RP {p R ) (^j . (20) 

The moment generating function (MGF) is defined as 

M(t) = (exp(WP~)} = J dpupip^e^/P (21) 

and is the analogous of the A4(t) functional of the continuous case. In turn, the MGF 
can be expanded in McLaurin series, 



= V^^t" ; m n (R) 

' ^ TJ.I 



d n M(t) 



n\ dt n 

n=0 



(22) 



where the moments m n (R) are related to the correlation functions /x„ though the 
integral relation 

m n {R)= J d 3 Xl ...J d 3 x n W R {x 1 ) . . .W R {x n ) x 

[l + (terms of order < n) + /x„(xi, x„)] . (23) 

In a similar fashion, the cumulants or irreducible moments k n (R) are defined through 
the generating function 



K(t) = lnM(t) = £-7*" ; *, 



K 4n . , _ d"K(t) 



n\ ' " ~ df 

n=0 



(24) 



t=o 



which is analogous to the JC[t] generator of connected correlations. In fact, the 
cumulant turns out to be related to the connected functions according to 

k n (R) = J d 3 x l ... J d 3 x n W R {K 1 ) . . .W R {K n ) KniK!, ...,x n ) , (25) 
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which is the average value of the irreducible ra-point correlation function. Accordingly, 
fc 2 = £ is the variance of the distribution, while is the skewness (see ref.[15] for 
the relevance of skewness in cosmological context) and k± is the kurtosis. Suitable 
relations between k n and m n can be found by successively differentiating eq.(24), which 
resembles the analogous relations between connected and disconnected correlation 
functions (see eq.[12]). 

From eq. (21), the PDF is expressed as the inverse transform of the MGF as 

P(PR) = « ~ / dtM^e-^'r (26) 

It is worth noting that, for some models of the PDF, although the moments are 
well defined according to eq.(20), the series in eq.(22) or, equivalently, the integral 
in eq.(21), do not converge. For instance, this is the case of the lognormal PDF (see 
Section 4.2, below; see also refs.[16, 17] for a more detailed discussion on this point). 
In fact, for this model the divergence of the integral in eq.(21) is due to the long high- 
density tail of the corresponding PDF shape. What happens in cases like this is that 
the moments of integer positive order do not exhaust the whole one-point statistical 
information which is contained in the PDF. Obviously, this does not imply that, even 
for such PDFs, the moments m n are of scarce relevance. Instead, they are anyway 
useful instruments to compare data and simulations in order to assess the reliability 
of cosmological models. 

Instead of dealing with continuous distributions, in the analysis of galaxy 
catalogues as well as of N-body simulations one considers discrete point distributions. 
Therefore, a suitable prescription is required in order to relate the statistics of the 
underlying density field to that of its discrete realization. A usual assumption is 
that the point distribution one deals with represents a Poissonian realization of an 
underlying continuous field. Let N be the average number of objects within the 
sampling volumes. In a random realization of a given value of p R , the actual number 
of points must obviously be an integer. Its expectation (non-integer) value over 
all the random realizations of p R is (N/p)p R , fluctuations around this value being 
described by a Poissonian statistics. The PDF for a Poisson process <p with mean (p 

is pp(<p) = Y^iN=o am" e ~^^d{^P — N). Therefore, the PDF for a process x = <pp/N is 
Pp(x) = pp(<p)N I p. Accordingly, the MGF reads 



M P {t) = J dxp P {x)e tx 'P = exp j p R (e*/* - l) 



(27) 



This procedure concerning a particular p R is to be averaged over all the possible 
realizations of the p R process. In this way we obtain the MGF for the discrete counts, 
which reads 

M disc (t)= Jdp R p(p R )exp ^p R (e t/N -l^j 

= M[N (e t/N - l)] . (28) 

The discrete nature of the point distribution is therefore accounted for by the change 
of variable t — > N{e l l N — 1) in the functional dependence of M(t), which leaves the 
variable unchanged in the limit N — > oo. 
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As for the PDF, in the discrete case eq.(26) gives 

p( PR ) = — / dtM[JV( e "/ JV -l)] e "'«^. 



(29) 



Since the variable e lt l N takes values only on the unit circle of the complex plane, the 
MGF turns out to be a periodic function. Therefore, its Fourier transform can be 
written as a sum of Dirac 5-functions: 



v{pr) 



N 



+ oo 



N=-oo 



p N 



(30) 



Accordingly, the PDF vanishes except for a discrete set of values of pn/p, as it must 
for a point distribution. In the above expression, the coefficients Pn(R) are 



P N (R) = Idyy^ N -^M[N(y 
2m J 



!)]• 



(31) 



For analytical M(t) all the iV's for N < vanish, so that they acquire the meaning of 
probabilities of finding N points inside a volume of size R. For N — > oo and N — > oo, 
with fixed N/N, eq.(30) gives back the continuous limit Pn(R) = (p/ N)p(pn), with 
the efFective density variable given by p R /p = N/N. 

The statistics of the point distribution can be described in terms of the central 
moments /x„ = ((N — N) n )/N n , where the moments of counts 



N n 



d n M disc {t) 



N=l 



dt n 



(32) 



are the coefficients of the McLaurin expansion of the discrete MGF. According to the 
above relations and following the definition (25) of cumulants, it is possible to express 
k n , which characterize the underlying continuous field, in terms of the the measured 
moments of discrete counts. At the lowest orders, it is 



M2 

Ms 

P-4 



1 

N~ 



+ k 2 , 



6 

ft* 
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M2 

N 2 



M3 

N 



3/X2 



(33) 



(see, e.g., ref.[7]), while more cumbersome relations hold at higher orders. As expected, 
all the shot-noise corrections vanish for large N values, while they dominate the signal 
when the sampling rate is very low (N <C 1). In this case, recovering the continuous 
statistics become a rather noisy procedure. It can be also shown that, although the 
relations (33) have been obtained on the ground of the relation (28) between discrete 
and continuous MGFs, their validity is not conditioned by the existence of M(t) (see, 
e.g., ref.[58] for a derivation of shot-noise corrections under general conditions). 

A more serious reason of concern in the application of eqs.(33) comes from the 
fact that they are based on the assumption that the point distribution represents a 
random sampling of an underlying continuous field. However, if observable objects 
trace the high-density peaks, as expected for galaxies and clusters of galaxies, they 
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are far from being a Poissonian sampling of the dark matter distribution. Therefore, 
such corrections are not expected to recover the statistics of the background density 
field. Furthermore, for generic non-linear relations between density field and object 
distribution it is not guaranteed a priori the possibility of self-consistently defining a 
continuous field, for which the observed galaxy distribution represents a Poissonian 
sampling. For these reasons, it is not necessarily recommendable to apply shot- 
noise corrections when comparing moments obtained from real data and numerical 
simulations, once care is taken to reproduce in the artificial data set the same galaxy 
number density as in the real one. 

2.4- The hierarchical model 

A rather popular model for connected correlations is represented by the hierarchical 
ansatz 

t„ (n-l) 
K„(xi,...,X„) = ^ Qn,a JJ «2,ij (34) 

n — trees a labelings edges 

which expresses the ra-point connected function in terms of products of (n— 1) 2-point 
functions [25]. In eq.(34), distinct "trees" designated by a have in general different 
coefficients Q a , while the complete sequence of these coefficients uniquely specifies the 
hierarchical model. Configurations that difFer only in interchange of labels \,...,n all 
have the same amplitude coefficients, and ij is a single index which identifies links. 
The number of trees t n with n vertices is fixed by a theorem of combinatorial analysis, 
while the total number of labeled trees is T n = n n ~ 2 . Thus, eq.(34) has t n amplitude 
coefficients (a = 1, ...,t n ) and T n total terms. For instance, for n = 3 it is = 1 and 
T3 = 3, so that 

C123 = Q [6263 + 6263 + 6s6s] • (35) 
For n = 4 it is t± = 2 and T4 = 16. The resulting structure of the 4-point function is 

?7i234 = Qi,i [626364 + • • • (4 terms)] + 

<?4,2 [626364 + ••• (12 terms)] . (36) 

By inserting the expression of eq.(34) for the connected correlations into the 
definition (25) of cumulants, they can be written as 

k n = Sn£ ; (3^) 

The reduced cumulants S n are given by suitable combinations of the Q n ^ a coefficients 
and their value also depends on the window function profile. Accordingly, the cumulant 
generating function becomes 

K(t) = r 1 E-r «"*)"• ( 38 ) 

n = l 

The relevance of the hierarchical scaling lies in the fact that it is supported by 
observations (see below) and that it finds dynamical justifications in the framework 
of the gravitational instability picture. 

In the mildly non-linear regime of gravitational clustering, hierarchical scaling is 
predicted by perturbative approaches. Peebles [58] showed that, applying second- 
order perturbation theory to the unsmoothed density field, it turns out that S3 = 
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34/7. Fry [24] demonstrated that hierarchical correlations of any order follow from 
perturbative analysis. Juszkievicz et al. [37] found that S3 depends on the spectrum 
profile and, for a top-hat window, it is 

S 3 = y-(» + 3), (39) 

where n is the spectral index, P(k) oc k n . Bernardeau [4] developed a general 
formalism to work out the expression of Sjy at the generic order N, still for a top-hat 
window. Catelan &: Moscardini [13] and Lokas et al. [47] provided expressions for S3 
and 54 in the case of Gaussian window. 

In the strongly non-linear regime, the hierarchical scaling is predicted by the 
closure of the BBGKY equations [23, 24, 35], although no general agreement exists 
between different authors about the sequence of the Q n coefficients. 



3. Error analysis 

One of the most important issues in any statistical analysis of the galaxy and cluster 
distributions is related to the estimate of the uncertainties that should be attached 
to the measured quantities. Several sources of errors are in general present, which 
are connected mainly to the limited number of objects included in any observational 
sample and to the finite size of the sampled portion of the Universe. 

Taking properly into account such uncertainties is of crucial relevance for at least 
two reasons: (a) to establish the statistical significance of any measured clustering 
signal; (b) to assess by how much this signal is different with respect to that provided 
by a reference model, such as a cosmological simulation of a given dark matter scenario. 



3.1. Sampling errors 

They are due to the finite number of points in a given data set, whose effect is that 
of producing a noisy sampling of the underlying statistics that one would measure. A 
first prescription to estimate such errors relies on the assumption that the observed 
object distribution is a Poissonian sampling of an underlying statistics. In this way, 
the relative statistical uncertainty of a given measure follows the l/\/N rule, N being 
the number of "data" on which that measure is based. As an example, let us consider 
the two-point correlation function, £{r), which can be estimated as 

(cf. ref.[49], and references therein). In the above relation DD(r) is the number of 
pairs of data points at separation r, while RR(r) is the number of pairs for a random 
distribution having the same number of points as in the real one. Accordingly, the 
expected uncertainty in the RR determination is o-rr = V RR and corresponds to the 
scatter between different realizations of the random sample. Therefore, the Poissonian 
error in the estimate of £(r) is 

^) = ,/^W= i ^= 1 (41) 

which has the expected scaling with DD. However, as we also discussed for the shot- 
noise corrections of eqs.(33), the assumption of Poissonian sampling of a background 
statistics is always rather problematic. 
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In order to have more realistic estimates of the sampling errors, a possibility is to 
find a suitable way to slightly perturb the point distribution and check the stability of 
the statistical measure against the perturbation. A vastly applied method to perturb 
the original sample is based on the so-called bootstrap resampling technique (see, e.g., 
ref.[46]). This method is based on the generation of pseudo data sets, that are obtained 
by randomly selecting N data points from the original data set containing as many 
points, by allowing for repetition. More in detail, suppose that X = {X\, . . . , Xjy} 
is the set of N raw data. A bootstrap sample Y is then obtained by randomly 
sampling the X vector N times. By repeating this operation n times, one ends up 
with an ensemble of Yj (i = l,...,n) of bootstrap samples. If iv? is the result of 
some measure on the i-tb. bootstrap sample (for example, the two-point correlation 
function £(r)), then the variance over the Y,- ensemble is 



e:=i«-^) 2 



(42) 



with w* = Y^i=i w i / n the bootstrap-averaged quantity. Under general conditions, it 
can be proved [40] that the variance evaluated over the ensemble of such bootstrap 
resamplings converges to the true sampling error when the number of such resamplings 
is sufficiently large, <rf rue = limn^oo a\. It is also possible to show [54] that in typical 
cases the bootstrap errors are about a factor \/Z larger than the Poisson error of 
eq.(41). 

It is worth recalling that the bootstrap resampling procedure gives only the 
sampling uncertainties in the estimate of the w quantity, whose value must not be 
confused with the bootstrap-averaged w* , which can be in general different since it 
refers to the perturbed data set. 



3.2. Cosmic variance 

This kind of uncertainty arises because of the intrinsically limited extension of the 
volume surveyed by any sample. In fact, we expect that difFerent portions of the 
Universe difFer from each other, the amount of the scatter being rather large for small 
patches, while decreasing for larger and larger volumes, as the "fair sample" size is 
approached. 

On the other hand, typical sizes of available observational samples are by far not big 
enough to allow an estimate of the large-scale cosmic variance over a sufficiently large 
number of independent volumes. In order to overcome this limitation one resorts to 
numerical simulations to estimate the amount of cosmic variance. It is however clear 
that its value depends on the assumed cosmological model. In particular, a larger 
scatter is expected if the assumed model power spectrum has larger fluctuations on 
large scale. 

Therefore, the best strategy to compare data and theoretical models by including 
the efFects of cosmic variance can be sketched as follows. 

(a) Run for each model a large number of independent realizations, each reproducing 
the basic features (i.e., object number density and selection functions) of the real 
data set. 

(b) Check how many of such realizations reproduce the results of the real data 
analysis, so as to assess the reliability of that model. 
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The effect of the cosmic variance on the clustering analysis of galaxy clusters 
can be judged from Figure 2, where the heavy crosses corresponds to the bootstrap 
uncertainties in the variance and skewness analysis (see below) while the clouds of 
points are due to the cosmic variance effect for each of the six simulated dark matter 
models. In this plot each open symbol corresponds to one realization of the Abell/ACO 
observational sample for different dark matter models. It is interesting to note that 
different realizations can provide results which may differ from each other by a rather 
large amount; at the smoothing scale R sm = 30/i _1 Mpc, the scatter is of about half 
an order of magnitude for the variance a 2 and about one order of magnitude for the 
skewness 7. This suggests that it can be quite dangerous to draw conclusions about 
the reliability of a dark matter model only on the ground of few realizations. 

4. Results from observations 

4-1. Correlation analysis 

Analyses of the three- and four-point correlation functions for galaxy and cluster 
distributions have been pursued by several authors from about fifteen years using 
both angular and redshift samples (see refs.[8, 69] for the relevant references). 

As for galaxies, results converge to indicate that the hierarchical model of eq.(35) 
is a good fit to data with Q ~ 1. More recently, several attempts have been pursued to 
work out the reduced cumulant S n from different observational samples. The general 
outcome is that the hierarchical scaling is rather well reproduced. As an example, I 
plot in Figure 1 the results of the S3 and ^4 analysis for a volume-limited subsample 
of the Perseus-Pisces Survey (PPS) [31], which contains all the galaxies with absolute 
magnitude M < —19. The analysis has been realized by computing the moments of 
counts 



where iV,- is the number of objects within the i-tb. sphere (i = 1,...,M), which is 
randomly placed within the sample boundaries. In the present analysis, at each scale 
R the total number of sampling spheres is M = 2V tot /V(R), where V tot is the total 
sample volume, V(R) = (47r/3)iZ 3 and the factor two should account for the presence 
of clustering in the galaxy distribution. The plotted errorbars are the r.m.s. bootstrap 
scatter estimated over 40 resamplings, which have been checked to be enough to ensure 
the convergence of the bootstrap procedure. 

The dashed lines correspond to the best-fitting values of the reduced cumulants, 
S3 = 1.8 and S4 = 4.8. If one would compare the above S3 value with the prediction 
(39) of perturbation theory, it turns out that n ~ for the effective spectral index at 
the scales considered here (^ 10/i _1 Mpc). This value seems quite large if compared to 
the predictions of current dark matter models, like CDM or CHDM, at the same scales. 
However, there are several reasons for caution when doing this kind of comparisons: 

(a) eq.(39) holds only in the dynamical regime where perturbation theory is expected 
to hold, that is &2 = £~ 1; 

(b) effects of redshift-space distortions can significantly affect the clustering pattern 
in observational samples; 




(43) 
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(c) eq.(39) refer to the whole matter density field, while galaxies are expected to be 
biased and point-like tracers of this continuous distribution. 




0.5 1 0.5 1 

log k 2 log k 2 

Figure 1. The variance— skewness (left panel) and the variance— kurtosis (right panel) 
relations for a volume— limited subsample of the Perseus— Pisces redshift survey. The 
dashed lines represent the best fitting hierarchical predictions (see Table 1). 



As a matter of fact, the non-linear clustering developed by N-body simulations 
displays in general a much less accurate hierarchical scaling than observations (see, 
e.g., refs.[71, 48]). This behaviour has been interpreted in terms of sampling limitations 
[20], redshift-space distortions, which could make S n more constant in redshift - 
than in real-space [45, 70, 33] (see however ref.[27]), and high-peak identification 
for galaxies [7]. 

For these reasons, the best way to compare model predictions to observations is to 
pick up galaxies from N-body simulations in some realistic (physical) way and extract 
mock samples by reproducing as best as we can the observational biases (i.e., sample 
boundaries, selection functions, object number density, etc; see also ref.[64]). In Table 
1 I compare the results for the PPS sample to those for mock samples extracted from 
high-resolution N-body simulations for a Cold+Hot Dark Matter (CHDM) model, 
which contains 30% of hot component contributed by one massive neutrino (see, 
e.g., ref.[42] for the cosmological relevance of this model). Also reported are the 
corresponding values taken from the literature for other samples. Apart from the 
remarkable agreement between real and simulated PPSs, we note that all the results 
converge to indicate that S3 ~ 2 and S4 in the range 5-10 characterize the galaxy 
clustering. 

As for clusters, despite the lower significance of the signal with respect to galaxies, 
due to the sparseness of the distribution, there is evidence that their three-point 
correlation function agrees with the hierarchical model for Q ~ 0.6 (see, e.g., ref.[8] 
and references therein). As for the count-in-cell statistics, several authors found 
recently that S3 ~ 2 for both angular [9] and redshift [60, 29] samples. 

In Figure 2 I report the variance-skewness relation for Abell/ACO clusters and 
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Table 1. Values of the S3 and S4 coefficients for simulations and for real galaxy 
samples. Only results for PPS, for IRAS by Bouchet et al. [11] and for CHDM 
simulations refer to redshift space. 



Sample 



s 3 



s 4 



CHDM 
PPS 
CfA [58] 
CfA [27] 
SSRS [27] 
IRAS [52] 
IRAS [27] 
IRAS [11] 
APM [28] 



1.8 ± 0.3 
1.8 ± 0.2 

2.4 ±0.2 
2.0 ± 0.2 
1.8 ± 0.2 
2.2 ± 0.2 
2.2 ± 0.3 

1.5 ± 0.5 
3.8 ± 0.4 



6.1 ± 2.2 
4.8 ± 1.5 



6.3 ± 1.6 

5.4 ±0.2 
10 ± 3 

9.2 ± 3.9 
4.4 ±3.7 
33 ± 7 



compare it with similar results from the analysis of an extended set of cluster 
simulations based on an optimized implementation of the Zel'dovich approximation 
[68], as described in ref.[59]. The moments for the cluster distribution have been 
estimated by using a Gaussian window and the two plotted data refer to R = 20 
and 30/i _1 Mpc for the window radius. The six panels are for different dark matter 
models and each point refers to one simulation of that model, containing about the 
same number of points as the real sample. As already mentioned, the scatter of such 
points gives an idea of the cosmic variance, i.e. of the variation of the results when 
taking different patches of the Universe. Apart from the details of the dark matter 
models (see ref.[10]), it is worth noting how discriminatory this statistic is; for several 
models almost no observer measures cumulants in the observational range, while for 
other models the real data are rather typical. 

4-2. The probability density function 

Instead of studying the moments of a distribution, an alternative method, which is 
becoming increasingly popular, is the study of the probability density function (PDF). 
Usually one attempts to obtain a continuous density field by smoothing the discrete 
distribution of objects with some window function like those of eqs.(18) and (19). 

Different expressions for models of the PDF have been introduced in the literature. 
The more common are listed as follows. 



(a) The Gaussian PDF 



p{e) 



1 



r (g-i) 2 i 



(44) 



V27TO- 2 



exp — 



where a 2 



is the variance of g = p/p. 
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Figure 2. Variance— skewness relation for clusters, using the Gaussian window. 
The six panels are for different dark matter models Two window radii are used: 
R = 20 h~ 1 Mpc (upper data) and R = 30 h~ 1 Mpc (lower data). Heavy crosses are 
the observational results based on the Abell— ACO redshift cluster sample. 



(b) The lognormal distribution given by 



V^7 



■. exp 



(lng-/x £ ) 2 



2-1 



(45) 



where g is obtained through an exponential transformation of a Gaussian random 
variable x as Q = ex P(x)- I n ec l- (45), (J,l and <7l are the mean and standard 
deviation of x = In Q respectively. 



(c) The PDF resulting from the application of the Zel'dovich approximation to 
Gaussian initial fluctuations [43]: 

p n (s) = V5s{l/2 +cos[2/3 ( 

+ 1/3 arccos (54/^ 3 -l)]}, (46) 

where N s is the average stream number per Eulerian point (see ref.[43]). 



16 



(c) The Edgeworth expansion [38] 

1 



p{e) 



V^7 



■ exp 



(<?-l) 



2 1 



2<7 2 



S 3 H 3 {x) fSjHjjx) SjHejx) 



24 



72 



(47) 



where x = 6 /a and H n are the Hermite polynomials. This expression represents 
an expansion which holds for small a values around the Gaussian expression. Its 
reliability however breaks down at a > 0.5, where p(g) becomes unphysically 
negative. 




Figure 3. Comparison between the PDFs of simulated cluster distributions and the 
theoretical models, at R S m = 20 and 40 h~ 1 Mpc for the Gaussian smoothing scale. 
Solid, long— dashed and short— dashed curves correspond to the lognormal, Zel'dovich 
and Gaussian model, respectively. Error bars are cosmic r.m.s. scatter evaluated 
over 50 realizations of each model (taken from [10]. 



Kofman et al. [43] computed the PDF for the IRAS sample and for the density field 
reconstructed with the POTENT procedure with fi = 1. They found that the PDF 
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is well modelled by the lognormal expression. By comparing these results with CDM 
N-body simulations, they also concluded that this results is perfectly consistent with 
the assumption of Gaussian initial conditions. Plionis &: Valdarnini [60] and Kolatt et 
al. [44] studied the PDF of the Abell/ACO smoothed cluster distribution and found 
that also for clusters the PDF is well approximated by a lognormal distribution. In 
ref.[10] we compared the PDF for Abell/ACO clusters to that of cluster simulations 
based on six different dark matter models. We found that the shape of the PDF is 
a stringent tests for such models, thus suggesting its usage as a useful discriminatory 
statistics. 

Coles &: Jones [16] argued that the lognormal distribution provides a natural 
description for density perturbations resulting from Gaussian initial conditions in the 
weakly non-linear regime. On the other hand, Bernardeau &: Kofman [5] have shown 
that the lognormal distribution is not really a natural consequence of mildly non- 
linear gravitational evolution, but a very convenient fit only in some portion of the 
(£,7x)-plane (i.e. £ <C 1 and spectral index n ps — 1). 

In Figure 3 I plot the PDF for cluster simulations. The plots refer to simulations 
based on two rather different initial spectra, namely the standard CDM model (top 
panels) and the CHDM model (bottom panels). It turns out that, despite the difference 
between the two initial spectra and the fact that the underlying dynamics is regulated 
by the Zel'dovich approximation, the lognormal expression fares much better than 
that of eq.(46). Note also that the lognormal model remains a better fit than the 
Gaussian one also at the larger smoothing scale, where the variance is well below 
unity (cr 2 ~ 0.06 and cr 2 ~ 0.04 for SCDM and CHDM, respectively; see ref.[10]). 

5. Geometrical descriptions of the LSS 

The variety of structures in the galaxy distribution, like filaments, voids, clusters, 
extending over a broad range of scales, calls for a global description of the geometry 
of the LSS. Although the correlation analysis provides rather useful information, 
nevertheless it says only a little about the "shape" of the galaxy distribution. For this 
reason, many attempts have been devoted to develop and apply statistical methods, 
which were able to provide such a description. A treatment of all such approaches is 
beyond the scope of these Lectures. They include the study of percolation properties 
[41], Minkowski functionals (see ref.[67] and references therein), structure of the 
minimal spanning tree [6] or other graph statistics, filamentarity analyses (see, e.g., 
ref.[21]), etc. In the following I will only describe two popular examples of such 
statistics, namely the void probability function (VPF) and the topology analysis of 
the genus characteristics. 

5.1. The void probability function 

The void probability function (VPF) is defined as the probability of finding no objects 
within randomly placed sampling volumes. According to its definition, it represents 
the N = case of the Pjy count probabilities defined by eq.(31). Therefore, it is 
connected to the sequence of cumulants as 




.n=i 



oo 



(48) 
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(fci(-R) = 1). Since Po conveys information about correlations of any order, the 
VPF statistic has been suggested as a useful tool to provide a global clustering 
characterization. Note, however, that Po depends only on the number of non-empty 
cells, with no regard to the number of objects contained inside them. For this reason, 
it provides only a description of the geometry, rather than of the clustering, of a point 
distribution. 

For a completely uncorrelated (i.e. Poissonian) distribution, it is Po = exp(— N), 
so that any departure of the quantity 

aiN,R) = (49) 

from unity represents the signature for the presence of clustering. 

Assuming hierarchical scaling for correlations, and owing to the expression (37) 
for k n , it follows that 

°° C- N \ n ~ 1 

°w = E- — t — 5 «' ( 5 °) 

n = l 

where N c = N£ is the average object count in excess with respect to a random 
distribution. Therefore, while the value taken by a — 1 states the deviation of the 
distribution from Poisson, the scale dependence of a in the hierarchical scaling regime 
can be expressed directly through N c (and not through N and R separately). 
In the analysis of their scale-invariant model for correlation functions, 

k„(Axi,...,Ax„) = A" 7( " _1) k„(xi, . . .,x„) (51) 

(where £(R) oc R~ J ), Balian &: SchaefFer [2] found that for asymptotically large N c 
the power-law relation (r(N c ) oc N~ u should hold, with < w < 1. In the framework 
of hierarchical correlation pattern, several models have been proposed, each providing 
a different expression for the VPF. Among these models is the thermodynamical one 
[66], which predicts 

<7(7\T C ) = (1 + Nc)- 1 ' 2 . (52) 

A further model [26] describes the galaxy clustering as due to a Poissonian distribution 
of clusters, each containing a suitable number of members. The resulting hierarchical 
Poisson distribution gives 

1 - e-*" 

<Nc) = — j^- (53) 
The negative binomial model [12] predicts 

and has been shown to provide a quite good fit to CfA data [30]. Finally, the 
phenomenological model 

^=( i+ sr (55) 

has been proposed by Alimi et al. [1], which found a best fit to the CfA data for 
w = 0.50 ± 0.15 (note that for w = 0.5 eq.[55] coincides with the thermodynamical 
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model). A similar result has also been found by Maurogordato et al. [51] from the 
analysis of the SSRS survey and by Bonometto et al. [7] from the analysis of CDM 
and CHDM simulations. 

The VPF has been suggested as a potentially powerful discriminant between 
different cosmological models. It is however clear that it is also very sensitive to 
the details of the object distribution. For instance, adding few points in underdense 
regions is expected not to significantly affect correlation functions, while it may greatly 
modify the VPF. Indeed, Weinberg &: Cole [73] found that the VPF is sensitive to the 
galaxy identification scheme in N-body simulations. In addition, selection effects in 
real samples, like boundary geometry, redshift-space distortions, etc., can make rather 
difficult any comparison of VPF results for real and simulated universes. 

In Figure 4 I report the VPF results of the comparison between CDM and 
CHDM simulations and a volume-limited subsample of the Perseus-Pisces redshift 
survey [32]. The four panels are for two different realizations of CHDM and for two 
different evolutionary stages of CDM. The dashed curves represent the shot-noise level, 
Po = e~ N , the solid curve is for the real data set and the different dotted curves in each 
panel are for different mock samples extracted from each simulation box. Note that 
in this case different samples do not involve independent volumes. Instead, they are 
obtained by sampling almost the same simulation volume from different advantage 
points. Therefore, the scatter between the curves is the effect of a kind of "local 
variance", rather than of the cosmic variance, whose larger effect may be judged by 
comparing the results for the two CHDM realizations. 

It is remarkable that differences between different realizations are of the same 
order as differences between different dark matter models. Therefore, although such 
results confirm the VPF as a useful discriminatory statistics, they also suggest the 
necessity of having cosmic-variance effects under control, either choosing a larger box, 
or running constrained simulations, which contain ab initio the essential features of a 
specific observational sample. 



5.2. Topology 

Instead of providing a detailed description of topological concepts in a formal 
mathematical language, in the following I only briefly introduce the measures of 
topology which are applied in cosmological context and what we learn from their 
application (see ref.[53] for a review about the cosmological applications of topology 
measures). In this context, the concept of "genus" has been introduced to describe the 
topology of isodensity surfaces, drawn from a density field. The genus G of a surface 
can be defined as 

G = (number of holes) — (number of isolated regions) + 1 . (56) 

Therefore, a single sphere has genus G = 0, a distribution made of N disjoint spheres 
has G = —(N — 1), while G = 1 for a torus. More in general, the genus of a surface 
corresponds to the number of "handles" it has, or, equivalently, to the number of cuts 
that can be realized on that surface without disconnecting it into separate parts. A 
more formal definition of genus can be given by means of the Gauss-Bonnet theorem 
(see, e.g., ref.[56]), which relates the curvature of the surface to the number of holes. 
According to this theorem, for any compact two-dimensional surface the genus G is 
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Figure 4. The scale-dependance of the VPF Po(R) is shown for the Mi; m < 19 
volume— limited sample of the Perseus— Pisces Survey (continuous curve) and for five 
different artificial VLS's obtained from each simulation (dotted curves). The five 
realizations of artificial VLS's have different observer positions but the same number 
of galaxies as in the real VLS. The dashed curve is what one expects for a Poissonian 
distribution (taken from [33]). 



related to the curvature C according to 

C = J KdA = 4tt(1-G). (57) 

Here K represents the local Gaussian curvature of the surface that, at each point, 
is defined as the reciprocal of the product of the two principal curvature radii, 
K = (ctic^) -1 . Since K has the dimension of length -2 , the curvature C and, thus, 
the genus are dimensionless quantities. For a sphere of radius r it is K = r~ 2 , so 
that C = 47r and G = 0, as previously argued. Strictly speaking, while the genus of a 
surface gives the number of its "handles", eq.(56) defines a related quantity, that is the 
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Euler-Poincare (EP) characteristic [56]. In a sense, we can say that, while the genus 
deals with the properties of a surface, the EP characteristics describe the properties 
of the excursion set, i.e. of the part of the density field exceeding a density threshold 
value. Based on the Gauss-Bonnet theorem, it can be proved that genus and EP 
characteristics are completely equivalent in the three-dimensional case. 

In order to quantify the genus of the observed large scale clustering, the first step 
is to extract a continuous density field starting from the discrete object distribution. 
This can be done by collecting the points in cells and then by smoothing the resulting 
cell count with a suitable window function. In order to keep Poissonian shot-noise 
from dominating the geometry of the smoothed field, the smoothing radius should be 
chosen not to be much smaller than the typical correlation length. 

In topology analysis it is useful to study the dependence of the genus of isodensity 
surfaces on the value of the density thresholds. If a high density value is selected, 
only few very dense and isolated regions will be above the threshold and the genus is 
negative. For a very low threshold, only few isolated voids are identified and, again, 
the corresponding genus is negative. For thresholds around the median density value 
we expect in general that the isodensity surfaces have a multiply connected structure, 
with a resulting positive genus. These general considerations can be verified on a more 
quantitative ground for models having an analytical eexpression for the genus. The 
simplest case occurs for a Gaussian random field (see, e.g., ref.[3]), which, in three 
dimensions, has a threshold-dependent genus per unit volume 

The density threshold is set so as to select only fluctuations exceeding v times the 
r.m.s. value <7. Therefore, g{v) describes the topology of the isodensity surfaces, 
where the fluctuations take the value 8 = vcr. Moreover, 

a _ J P{k)W 2 {k)k 2 d 3 k 
{ ' ~ JP(k)W 2 (k)d^k (59) 

is the second order spectral moment, which depends on the choice of the window W(k) 
used to smooth the discrete distribution. According to eq.(59), g{v) depends on the 
shape of the power spectrum, but not on its normalization. Since the amplitude of 
the genus curve turns out to depend on the profile of the power spectrum through 
the second-order spectral moment, repeating genus measures for different smoothing 
radii gives information about the shape of P(k). Following eq.(58), several interesting 
features of the g{v) curve appear. First of all, as expected for a Gaussian field, which 
has the same structure in the overdense and underdense regions, g{v) is an even 
function of i>, with its maximum at v = 0. This is characteristic of the so-called 
"sponge-like" topology. For \i>\ < 1 it is g{v) > 0, due to the multiple connectivity 
of the isodensity surfaces, while g{v) < for \v\ > 1, due to the predominance of 
isolated clusters. Different topologies are however expected when non-Gaussian fields 
are considered [14]. 

In the case of a distribution realized by superimposing dense clusters on a smooth 
background, isolated structures start dominating also at rather low density values and 
the g{v) curve peaks at negative i>'s. Vice versa, at large and positive v values the 
distribution is that of isolated regions and g(y) becomes more negative than expected 
for a Gaussian field. This case is usually referred to as "meatball" topology. The 
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opposite case occurs when the distribution is dominated by big voids, with objects 
arranged in sheets surrounding the voids. The resulting topology is usually called 
"cellular" or "Swiss-cheese" and the corresponding fif(f) curve peaks at positive i>'s. 

Topological measures can also be usefully employed when dealing with two- 
dimensional density fields. However, in this case some ambiguities arise, for example in 
distinguishing whether an underdense area is due to a tunnel or to a spherical void in 
three-dimensions. In addition, the interpretation of the genus in terms of the number 
of handles of an isodensity surface can not be applied in two dimensions. In this case, 
the topology measure is represented by the EP characteristics, which is defined as 
the difference between the number of isolated high-density regions and the number of 
isolated low-density regions. The EP characteristics per unit area at the overdensity 
level v for a Gaussian random field is 



so that g(y) is an odd function of v and g(0) = 0. 

Application of the genus statistics to the study of LSS has been employed in recent 
years, analysing both the evolution of N-body simulations and observational data sets. 
Measures of the EP characteristics for the angular galaxy distribution [19] and of the 
genus for three-dimensional redshift surveys [34, 55, 72] consistently shows a too large 
genus amplitude if compared to N-body simulations of the standard CDM model. 
Furthermore, all the analyses indicate the presence of a slight meatball shift at small 
smoothing scales, followed by a sponge-like topology at larger scales, as expected on 
the ground of Gaussian initial fluctuations. Although the meatball shift at small scales 
is expected on the ground of non-linear gravitational evolution [50], attempts have 
been also devoted to check whether this result implies non-Gaussian initial conditions 
for the CDM model [18]. 

The application of the same analysis to galaxy clusters has been also realized both 
for their projected distribution [61] and for redshift surveys [65]. Also in this 
slight meatball shift is observed, which is however consistent with expectations based 
on random-phase initial conditions. Although the genus for the galaxy distribution 
has been compared quite in detail with simulations based on different dark matter 
models [72], the same has not yet been realized for clusters. This point will surely 
deserve future investigations. 

6. Conclusions 

As already anticipated in the Introduction, these Lectures should not be considered 
a comprehensive review about the application of advanced statistical methods in 
cosmology, for at least two reasons. Firstly, I gave only a partial view of the many 
techniques, which have been applied until now to characterize the galaxy clustering. 
My aim has been to introduce different statistical concepts which are able to pick 
up different characteristics of the large-scale structure (e.g., correlation properties, 
geometry and topology). However, one can well imagine other measures, like fractal 
scaling, filamentarity and percolation, which should be considered as complementary 
to those I described. Secondly, I dealt here with the characterization of the large- 
scale structure only in configuration space, while no words have been spent about the 
statistics of the velocity field traced by cosmic structures. This represents a relatively 
more recent field, which has undergone a progressive development during the last 
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few years, thanks to the availability of more and more reliable redshift-independent 
measurements of galaxy distances (see refs.[22, 69] for recent reviews). 

However, even relying on the material I presented here, at least two firm points 
can be established. 

(a) Today available samples of galaxies and galaxy clusters are already large enough to 
provide reliable clustering information, which go beyond the 2-point correlation 



(b) Such measures are discriminatory, in the sense that they often allow to distinguish 
between different dark matter models at a quite high confidence level. 

It is however clear that we are probably still far from having reached a satisfactory 
and self-consistent understanding of the formation and evolution of cosmic structures 
on the ground of the analysis of their distribution. However, we are at a point in which 
we expect in the reasonably near future a better clarification of both theoretical and 
observational aspects concerning the large-scale structure. Hopefully this will allow 
us to further restrict the range of viable models or will lead to a radical change 
in our view of the Universe. From the theoretical side, a crucial point concerns a 
deeper understanding of the physics underlying galaxy formation. One's hope is to 
address adequately this problem with the availability of new numerical techniques 
and computing facilities, so as to make clear what we are comparing to what when 
analyzing numerical simulations and observational data. From the observational side, 
we are waiting for the advent of new huge galaxy redshift surveys (e.g., SDSS), as well 
as the compilation of new cluster samples (e.g., ROSAT). 

For these reasons, I believe that the study of the large-scale structure will remain in 
the following years an exciting field of investigation: the development and refinement 
of methods of statistical analysis will represent a necessary ingredient to clarify our 
view of the Universe. 
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Appendix A. The functional derivative 

In order to introduce the concept of functional derivative of a given functional F[r(x)] 
with respect to t(x) £ T , let us consider a small function 5r(x) £ T , so that 
t(x) + 6t(x) differs from t(x) only in a neighbourhood of x = y. Moreover, let 



statistics. 




(Al) 



be the volume element in T contained between t(x) and t(x) + 5r(x). 
The functional derivative of F [t] is defined as 

6F _ F[t+8t] - F[t] 



(A2) 



<5r(y) 6u^o 8ui 
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Taking 8r(y) = 8u 8(x — y), eq.(A2) becomes 



8F 



lim 



F[t(x) + Su S(x - y)] - F[t(x)] 



8r(y) Su^o 8u> 
that, in the particular case F[r(x)] = t(x), reads 
8r(x) 



Sr(y) 



*(* - y) • 



(A3) 



(A4) 



In a similar way, higher order functional derivatives can be introduced. 
As an example, let us consider the functional 

F g[ T ] = J dx!...J dx ? /(xi, . . .,x ? )r(xi) . . .r(x ? ) , 



(A5) 



where f(x±, . . . , x ? ) is a symmetric function with respect to the variables xi, . . . , x ? . 
Differentiating the functional (A5), we get 



8F q 
Sr(y) 



J dx 2 ...J dx q /(y,x 2 , . . .,x ? )r(x 2 ) . . .r(x ? ) + . . . 



■+J dx!...J dx ? _i /(xi, . . .,x ? _i,y)r(xi) . . .r(x ? _i) . (A6) 

Making use of the symmetry of f(x±, . . . , x ? ) and relabeling the integration variables, 
we finally obtain 

qj dxi...J dx ? _i/(xi,...,x ? _i,y)r(xi)...r(x (? _i).(A7) 



6r{y) 

Similarly, the higher order derivatives are 

8 n F„ 



8T{y 1 )...8r{y n ) {q 

/(xi, . . .,x q _ n ,y 1 , . . .,y„)r(xi) . . .r(x ? _„) (71 < q) 



8 n F„ 







(»>«)• 



8T{y 1 )...8r{y n ) 
As a further example, let us consider the exponential functional 



(A8) 



F[t] = exp [y dx/(x)r(x) 



whose ra-th order derivative reads 

8 n F„ 



/(yi)---/(yjexp [| dx/(x)r(x) 



8T{y 1 )...8r{y n ) 
from which eq.(3) for the ra-point correlation function follows. 



(A9) 



(AlO) 
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